GH-47338: [C++][Python] Remove deprecated string-based Parquet encryption methods#47339
GH-47338: [C++][Python] Remove deprecated string-based Parquet encryption methods#47339pitrou merged 7 commits intoapache:mainfrom
Conversation
|
|
|
@pitrou Here is the follow-up on migrating Parquet encryption to There is no change to the Python API and only minimal glue-code for Python KMS client to work with the new C++ KMS client API. Rolling secure strings out to PyArrow is out of scope of this PR. |
| cdef extern from "Python.h": | ||
| # To let us get a PyObject* and avoid Cython auto-ref-counting | ||
| PyObject* PyBytes_FromStringAndSizeNative" PyBytes_FromStringAndSize"( | ||
| char *v, Py_ssize_t len) except NULL |
There was a problem hiding this comment.
I'm a bit surprised that this declaration is required. Cython already provides a declaration for this function and it would be surprised if it didn't have the right ref-counting behavior.
(also, by the way, if we require Cython 3.1 we will gain implicit std::string_view conversion to/from bytes objects)
There was a problem hiding this comment.
Importing PyBytes_FromStringAndSize via from cpython.bytes cimport PyBytes_FromStringAndSize works.
I have copied this from io.pxi, maybe it is not needed there as well / any more:
Lines 43 to 46 in cdc1459
| cdef: | ||
| cpp_string_view view = key.as_view() | ||
| key_bytes = PyObject_to_object( | ||
| PyBytes_FromStringAndSizeNative(view.data(), view.size())) |
There was a problem hiding this comment.
Of course this means the bytes object payload won't be cleared securely like the SecureString is.
We could try to tackle this as a followup issue/PR. bytes objects being immutable, this might be a bit delicate...
There was a problem hiding this comment.
Ok, even the cryptography package doesn't do anything about it, so we may just have to live with it:
https://cryptography.io/en/latest/limitations/#secure-memory-wiping
However,
cryptographydoes not clear memory by default, as there is no way to clear immutable structures such as bytes. As a result,cryptography, like almost all software in Python is potentially vulnerable to this attack. The CERT secure coding guidelines assesses this issue as “Severity: medium, Likelihood: unlikely, Remediation Cost: expensive to repair” and we do not consider this a high risk for most users.
There was a problem hiding this comment.
Yes, I deliberately left this piece of work for a separate PR.
Happy to see this as "Likelihood: unlikely, Remediation Cost: expensive to repair" as well.
|
|
||
| cdef: | ||
| c_string cstr = tobytes(key) | ||
| shared_ptr[CSecureString] css = shared_ptr[CSecureString](new CSecureString(move(cstr))) |
There was a problem hiding this comment.
It'd be nice to find if modern Cython allows us to make use of make_shared. At least it has a declaration for it:
https://github.com/cython/cython/blob/b862800bda6cc5f540d3b7961350cd2b6ad00dbd/Cython/Includes/libcpp/memory.pxd#L106-L107
There was a problem hiding this comment.
Works, this is much nicer, thanks!
| cdef void _cb_unwrap_key( | ||
| handler, const c_string& wrapped_key, | ||
| const c_string& master_key_identifier, c_string* out) except *: | ||
| const c_string& master_key_identifier, shared_ptr[CSecureString]* out) except *: |
There was a problem hiding this comment.
I wonder why this is expecting a shared_ptr output, is it because of Cython's limited support for move-only types?
There was a problem hiding this comment.
I made it work with CSecureString* out and CSecureString& out. What do you prefer?
| key = handler.unwrap_key(wk_str, mkid_str) | ||
| out[0] = tobytes(key) | ||
|
|
||
| cdef: |
There was a problem hiding this comment.
I wonder if a cdef: block is really required, especially as this is already a cdef function. Cython nowadays is more flexible than it used to be.
There was a problem hiding this comment.
Cython doesn't like the definition of a C_string:
Error compiling Cython file:
------------------------------------------------------------
...
cdef void _cb_unwrap_key(
handler, const c_string& wrapped_key,
const c_string& master_key_identifier, CSecureString& out) except *:
c_string cstr
^
------------------------------------------------------------
pyarrow/_parquet_encryption.pyx:323:13: Syntax error in simple statement list
Without the cdef of c_string cstr, it fails with
Error compiling Cython file:
------------------------------------------------------------
...
const c_string& master_key_identifier, CSecureString& out) except *:
mkid_str = frombytes(master_key_identifier)
wk_str = frombytes(wrapped_key)
key = handler.unwrap_key(wk_str, mkid_str)
cstr = tobytes(key)
out = CSecureString(move(cstr))
^
------------------------------------------------------------
pyarrow/_parquet_encryption.pyx:327:23: no suitable method found
There was a problem hiding this comment.
You are right, cdefs can be removed by using proper casting / c-type annotation: 79cd06e
|
Apart from the comments above, these changes look ok. |
|
All comments addressed. Diff is much cleaner now. |
|
Hmm, can you rebase/merge from main to fix most of the macOS CI builds? |
pitrou
left a comment
There was a problem hiding this comment.
LGTM for the most part, just one request
79cd06e to
fb5a9de
Compare
|
Comments addressed |
fb5a9de to
b125031
Compare
|
I rebased on main to get up-to-date CI results. |
|
Thanks a lot again @EnricoMi :) |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 403ba70. There weren't enough matching historic benchmark results to make a call on whether there were regressions. The full Conbench report has more details. |
…encryption methods (apache#47339) ### Rationale for this change Passing encryption keys to Parquet via strings is insecure. Users are encouraged to use the `SecureString` based methods introduced in apache#46017 instead. To enforce this migration, deprecated methods are removed. ### What changes are included in this PR? Remove deprecated string-based methods. ### Are these changes tested? Existing Parquet encryption tests. ### Are there any user-facing changes? C++ user code that passes encryption keys to Parquet is affected and has to be adjusted. Python user code is not affected. **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#47338 Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
Passing encryption keys to Parquet via strings is insecure. Users are encouraged to use the
SecureStringbased methods introduced in #46017 instead. To enforce this migration, deprecated methods are removed.What changes are included in this PR?
Remove deprecated string-based methods.
Are these changes tested?
Existing Parquet encryption tests.
Are there any user-facing changes?
C++ user code that passes encryption keys to Parquet is affected and has to be adjusted. Python user code is not affected.
This PR includes breaking changes to public APIs.